Automatic document filing according to user categories

نویسندگان

  • Dong CAO
  • Judith GELERNTER
  • Jaime CARBONELL
چکیده

Suppose you have just acquired a number of articles into a personal digital library. The classification procedures described here would allow those articles to be automatically sorted in your pre-existing desktop folders. We use the Chi-square statistic to find keywords that we use to classify the articles, and the Support Vector Classifier to organize the articles in folders. Moreover we have adapted the process so that minimal feedback from users should improve classification results, and folders do not contain overly many articles for information management convenience. We achieved an average of 96% accuracy over 10 trials with the Reuters 21578 news data, and 89% accuracy over 10 trials with biomedical literature downloaded from the PubMed digital library. We conclude that in a personal desktop environment, whether general or highly specific, our classification methods for automatic filing into the user’s own familiar folders would achieve high accuracy.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

RRLUFF: Ranking function based on Reinforcement Learning using User Feedback and Web Document Features

Principal aim of a search engine is to provide the sorted results according to user’s requirements. To achieve this aim, it employs ranking methods to rank the web documents based on their significance and relevance to user query. The novelty of this paper is to provide user feedback-based ranking algorithm using reinforcement learning. The proposed algorithm is called RRLUFF, in which the rank...

متن کامل

Modelspace - Cooperative Document Information Extraction in Flexible Hierarchies

Business document indexing for ordered filing of documents is a crucial task for every company. Since this is a tedious error prone work, automatic or at least semi-automatic approaches have a high value. One approach for semi-automated indexing of business documents uses self-learning information extraction methods based on user feedback. While these methods require no management of complex in...

متن کامل

Introduction Jonathan McElroy November 6 , 2009

One of the main goals of Computer Science is to increase the ease of HumanComputer Interaction by creating systems that assist users by allowing them to perform tasks more intuitively. With the rate of computers complexity, speed and disk space have increased continuously in recent years combined with the increase in usage of digital documents, there has been a greater increase in the desire to...

متن کامل

A survey on Automatic Text Summarization

Text summarization endeavors to produce a summary version of a text, while maintaining the original ideas. The textual content on the web, in particular, is growing at an exponential rate. The ability to decipher through such massive amount of data, in order to extract the useful information, is a major undertaking and requires an automatic mechanism to aid with the extant repository of informa...

متن کامل

Supporting user-subjective categorization with self-organizing maps and learning vector quantization

we requested the user to reclassify documents that were misclassified by the system. Results show that despite the subjective nature of human categorization, automatic document categorization methods correlate well with subjective, personal categorization, and the LVQ method outperforms the SOM. The reclassification process revealed an interesting pattern: About 40% of the documents were classi...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2010